1 ABSTRACT

AskoR is a pipeline for the analysis of gene expression data, using edgeR. Several steps are performed: data filters (cpm method), normalize this filtered data, look at the correlation of our data, run differential expression analysis, compare contrast, GO enrichment and co-expression.

You’ll find a test set in the inst/extdata/input folder. It’ll be used for the rest of the documentation. Contents of input folder:

  • Count matrix file: CountsMatrix.txt OR Counts files per samples: files in “counts” directory (counts.tgz)
  • Samples file: Samples_CountsMatrix.txt OR Samples_CountsFiles.txt
  • Contrasts file: Contrasts.txt
  • Genes annotations file: Genes_annotations.txt (optional)
  • GO annotations file: GO_annotations.txt (optional)

IMPORTANT : All input files must be in a folder named input (case sensitive).

2 Input files description

2.1 Count files

2.1.1 Sample count files

You have a count file per sample, they can be in text or csv format. In this case, you will have to fill in the Samples file with the path and name of the count files for each sample in a “file” column. This one can contain several columns according to the counting tools used, you will have to inform the following parameters:

  • parameters$col_genes   column number with GeneId (default 1)
  • parameters$col_counts   column number with count (default 7)
  • parameters$sep   column separator (default “\t” )

Example of count file:

Geneid Chr Start End Strand Length Counts
Gene_000001 Random_Chr_001 1692 1907 - 215 0
Gene_000002 Random_Chr_001 6641 8705 - 2064 43.5
Gene_000003 Random_Chr_001 9228 9569 - 341 8
Gene_000004 Random_Chr_001 12009 13155 - 1146 781
Gene_000005 Random_Chr_001 15242 15844 + 602 16
Gene_000006 Random_Chr_001 16304 19834 + 3530 9
Gene_000007 Random_Chr_001 20595 21625 - 1030 13.83
Gene_000008 Random_Chr_001 22377 23461 - 1084 565.33

The corresponding parameters:

parameters$col_genes=1
parameters$col_counts=7
parameters$sep="\t"

In this example column 7 contains the counts and the gene identifiers are in the first column, it’s a tabulate file so the column separator is encoded “\t”.

2.1.2 Counts matrix file

It is also possible to have a table, tabulated file, grouping the counts for each gene in each sample, in text or csv format. The Samples file should not contain a file column, you will have to fill in the name of the count file: parameters$fileofcount.

Example of a count matrix, the column separator is a tabulation:

Geneid AC1R1 AC1R2 AC1R3 BC1R1 BC1R2 BC1R3
Gene_000001 0 1 0 0 0 1
Gene_000002 43.5 25.33 31.5 27.5 29.5 29
Gene_000003 8 4 5 30 16 13
Gene_000004 781 412 626 558 538 346
Gene_000005 16 7 13 9 8 6
Gene_000006 9 4 5 21 15 12

2.2 Samples file

This tabulated file describes the design of experiments. The first and second columns are mandatory and are named “sample” and “condition”. You may have several other columns. The contents of the condition column will be the same as in the Contrast file.

The column “color” is optional, it allows to predefined the color of the sample in the graphs. If it is absent askoR will assign colors itself.

The column “file” is mandatory if you have samples counts files. In the example below, these files are grouped in a “counts” folder. You do not need to specify the name of the “input” folder (i.e. input/counts/AC1R1_counts.txt) since, by default, it will search for it.

Don’t forget to fill in the name of your Samples file: parameters$sample_file, no need to specify the “input” folder.

Example of a Samples.txt file:

sample condition genotype treatment color file
AC1R1 AC1 A C1 darkorchid2 counts/AC1R1_counts.txt
AC1R2 AC1 A C1 darkorchid2 counts/AC1R2_counts.txt
AC1R3 AC1 A C1 darkorchid2 counts/AC1R3_counts.txt
BC1R1 BC1 B C1 saddlebrown counts/BC1R1_counts.txt
BC1R2 BC1 B C1 saddlebrown counts/BC1R2_counts.txt

2.3 Contrast file

This tabulated file indicates contrasts you wish to make between your different conditions. The first column corresponds to the condition column of the Samples file, then the others are columns the comparisons to be made in the form ConditionXvsConditionY. Then under these columns, ConditionX will be noted + and ConditionY will be noted -, the rest 0. You will have to fill in the name of your file: parameters$contrast_file

Example of contrasts file:

Condition AC1vsAC2 AC1vsAC3 AC2vsAC3 BC1vsBC2
AC1 + + 0 0
AC2 - 0 + 0
AC3 0 - - 0
BC1 0 0 0 +
BC2 0 0 0 -

2.4 Genes annotations file

This tabulated file contains the annotations of your genes, it is optional. It can contain several columns but the first one must be the gene identifier. You will have to fill in the name of your file: parameters$annotation.

Example of annotation file:

SeqName Description
Gene_000001 hypothetical protein pbra 009537
Gene_000002 hypothetical protein pbra 009324
Gene_000003 histone-lysine n-methyltransferase nsd2
Gene_000004 hypothetical protein pbra 009496

2.5 GO annotations file

This tabulated file will be WITHOUT HEADER, the first column contains the gene identifier and the second column contains all the corresponding GOs separated by a comma. This file is optional, you will have to fill in its name: parameters$geneID2GO_file. Cf. GO enrichment Section

Example of GOs annotation file:

   
Gene_000001     GO:0003676,GO:0015074
Gene_000002     GO:0003676,GO:0015074
Gene_000003     GO:0005488,GO:0006807,GO:0016740,GO:0043170,GO:0044238
Gene_000005     GO:0005525,GO:0005525,GO:0005525

3 Out files

All the generated files and images will be in a folder named by default “DE_analysis”, you can change this name: parameters$analysis_name.

4 Initialize and load data

Now that we have our input files, look at the script, you should have these first lines:

# Path to askoR file 
library(askoR)

# Sets defaults parameters
parameters<-Asko_start()

Don’t forget to r}place the paths, the first one to the askoR.R script and the second one to your working directory (containing the input folder).

Once this step has been completed, you will be able to indicate the names of the analysis files:

# output directory name (default DE_analysis)
parameters$analysis_name="DEG_test"

# input files:
# matrix of different contrasts desired
parameters$contrast_file = "Contrasts.txt"    
# file containing the functional annotations for each gene
parameters$annotation = "Genes_annotations.txt"      
# GO annotation files
parameters$geneID2GO_file = "GO_annotations.txt"   
  • If you use a counts matrix :
# matrix of count for all samples/conditions
parameters$fileofcount = "CountsMatrix.txt"  
# file describing all samples
parameters$sample_file = "Samples_CountsMatrix.txt"  
  • If you use files of counts :
# file describing all samples
parameters$sample_file = "Samples_CountsFiles.txt"
# column with the gene names (default 1)
parameters$col_genes = 1          
# column with the counts values (default 7)
parameters$col_counts = 7 
# field separator (default "\t")
parameters$sep = "\t" 

We are informed that two samples, “AC3R2” and “BC3R3”, had problems during the experiments, it is requested to extract it from our analysis. No need to redo all the files, just use the parameter: parameters$rm_sample. You can provide a list of samples “c(”sample1”,“sample2”,“sample3”,…), or a single sample c(“sample1”). In the same way, if you only want to work on a part of your samples, you can use parameters$select_sample.

# delete sample AC3R2 
parameters$rm_sample = c("AC3R2","BC3R3")

It’s time to load your data:

data<-loadData(parameters)
## 
## Created directories:
##   ../inst/extdata//DEG_test/ 
##   ../inst/extdata//DEG_test/DataExplore/ 
##   ../inst/extdata//DEG_test/NormCountsTables/ 
##   ../inst/extdata//DEG_test/DEanalysis/ 
##   ../inst/extdata//DEG_test/DEanalysis/DEimages/ 
##   ../inst/extdata//DEG_test/DEanalysis/DEtables/ 
##   ../inst/extdata//DEG_test/DEanalysis/AskoTables/ 
## 
## Samples:
##  [1] "AC1R1" "AC1R2" "AC1R3" "BC1R1" "BC1R2" "BC1R3" "AC2R1" "AC2R2" "AC2R3"
## [10] "BC2R1" "BC2R2" "BC2R3" "AC3R1" "AC3R3" "BC3R1" "BC3R2"
## 
## Conditions :
##  [1] AC1 AC1 AC1 BC1 BC1 BC1 AC2 AC2 AC2 BC2 BC2 BC2 AC3 AC3 BC3 BC3
## Levels: AC1 AC2 AC3 BC1 BC2 BC3
## 
## Contrasts:
##     AC1vsAC2 AC1vsAC3 AC2vsAC3 BC1vsBC2 BC1vsBC3 BC2vsBC3 AC1vsBC1 AC2vsBC2
## AC1        +        +        0        0        0        0        +        0
## AC2        -        0        +        0        0        0        0        +
## AC3        0        -        -        0        0        0        0        0
## BC1        0        0        0        +        +        0        -        0
## BC2        0        0        0        -        0        +        0        -
## BC3        0        0        0        0        -        -        0        0
##     AC3vsBC3
## AC1        0
## AC2        0
## AC3        +
## BC1        0
## BC2        0
## BC3        -

You can see that DataExplore folder has been created in DEG_test:

Then the samples and conditions that have been loaded are displayed. These have been loaded into a structure called “data”. Some commands to display your data:

# Displays all samples recorded
data$samples  
##       condition genotype treatment       color
## AC1R1       AC1        A        C1 darkorchid2
## AC1R2       AC1        A        C1 darkorchid2
## AC1R3       AC1        A        C1 darkorchid2
## BC1R1       BC1        B        C1 saddlebrown
## BC1R2       BC1        B        C1 saddlebrown
## BC1R3       BC1        B        C1 saddlebrown
## AC2R1       AC2        A        C2  indianred3
## AC2R2       AC2        A        C2  indianred3
## AC2R3       AC2        A        C2  indianred3
## BC2R1       BC2        B        C2      khaki2
## BC2R2       BC2        B        C2      khaki2
## BC2R3       BC2        B        C2      khaki2
## AC3R1       AC3        A        C3  palegreen4
## AC3R3       AC3        A        C3  palegreen4
## BC3R1       BC3        B        C3  steelblue3
## BC3R2       BC3        B        C3  steelblue3
# Displays all contrast recorded
data$contrast 
##     AC1vsAC2 AC1vsAC3 AC2vsAC3 BC1vsBC2 BC1vsBC3 BC2vsBC3 AC1vsBC1 AC2vsBC2
## AC1        1        1        0        0        0        0        1        0
## AC2       -1        0        1        0        0        0        0        1
## AC3        0       -1       -1        0        0        0        0        0
## BC1        0        0        0        1        1        0       -1        0
## BC2        0        0        0       -1        0        1        0       -1
## BC3        0        0        0        0       -1       -1        0        0
##     AC3vsBC3
## AC1        0
## AC2        0
## AC3        1
## BC1        0
## BC2        0
## BC3       -1
# Displays design experiment
data$design   
##       AC1 AC2 AC3 BC1 BC2 BC3
## AC1R1   1   0   0   0   0   0
## AC1R2   1   0   0   0   0   0
## AC1R3   1   0   0   0   0   0
## BC1R1   0   0   0   1   0   0
## BC1R2   0   0   0   1   0   0
## BC1R3   0   0   0   1   0   0
## AC2R1   0   1   0   0   0   0
## AC2R2   0   1   0   0   0   0
## AC2R3   0   1   0   0   0   0
## BC2R1   0   0   0   0   1   0
## BC2R2   0   0   0   0   1   0
## BC2R3   0   0   0   0   1   0
## AC3R1   0   0   1   0   0   0
## AC3R3   0   0   1   0   0   0
## BC3R1   0   0   0   0   0   1
## BC3R2   0   0   0   0   0   1
## attr(,"assign")
## [1] 1 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Group
## [1] "contr.treatment"
# Displays the first 5 lines and 8 columns of counts table.
data$dge$counts[1:5,1:8] 
##             AC1R1  AC1R2 AC1R3 BC1R1 BC1R2 BC1R3 AC2R1 AC2R2
## Gene_000001   0.0   1.00   0.0   0.0   0.0     1   1.0   0.0
## Gene_000002  43.5  25.33  31.5  27.5  29.5    29  34.5  32.5
## Gene_000003   8.0   4.00   5.0  30.0  16.0    13  20.0  26.0
## Gene_000004 781.0 412.00 626.0 558.0 538.0   346 519.0 370.0
## Gene_000005  16.0   7.00  13.0   9.0   8.0     6  10.0  10.0
# Total number of genes:
dim(data$dge$counts)[1]
## [1] 12811
# Total number of samples:
dim(data$dge$counts)[2]
## [1] 16

The next step is to generate the files describing your experiences for Askomics. Even if you don’t plan to use Askomics, this command is mandatory because it generates a data structure “asko_data” that will be used in the further analysis.

asko_data<-asko3c(data, parameters)

5 Filtering data

For the filters the CPM method is used, you can set the cutoff values you want to:

# CPM's threshold 
parameters$threshold_cpm = 0.5  
# minimum of sample which are upper to cpm threshold 
parameters$replicate_cpm = 3 # we have 3 replicates
# run filtering 
asko_filt<-GEfilt(data, parameters)

# Total number of filtered genes: 
dim(asko_filt$counts)[1]
## [1] 10991

The filtered data is saved in a structure called here: asko_filt. In the folder DEG_test/DataExplore/, you should find the images representing your data before and after filtering.

Filtering DataFiltering DataFiltering DataFiltering Data

Figure 1: Filtering Data

Density GraphesDensity Graphes

Figure 2: Density Graphes

You notice that the legend of the density graphs is very low compared to the graph. You can correct this with the options parameters$densinset which modifies the position of the legend, it is also possible to define the number of columns with parameters$legendcol. Finally, restart “GEfilt” function.

# Set position the legend in bottom density graphe
parameters$densinset = 0.20
# Set numbers of column for legends
parameters$legendcol = 8
# run filtering
asko_filt<-GEfilt(data, parameters)
Density Graphes CorrectedDensity Graphes Corrected

Figure 3: Density Graphes Corrected

6 Normalize data

Once the filters have been made, we can proceed to the normalization of the data. At this step, you can generate file with normalize factor values for each sample parameters$norm_factor=TRUE and/or generate with normalize counts parameters$norm_counts=TRUE.

# run normalization
asko_norm<-GEnorm(asko_filt, asko_data, data, parameters)

Normalized data is saved in a structure called here : asko_norm. In the folder “DEG_test/DataExplore/”, you should find the images representing your data after normalization. Two files are automatically generated because they will be used for co-expression analysis: “DEG_test_CPMNormCounts.txt” and “DEG_test_CPM_NormMeanCounts.txt”.

Normalization graphs Normalization graphs

Figure 4: Normalization graphs

Normalization graphs (heatmap)Normalization graphs (heatmap)

Figure 5: Normalization graphs (heatmap)

7 Correlation

From the filtered and normalized data, we can re-correlate the correlation between our samples.

GEcorr(asko_norm,parameters)

Several graphics will be saved in the “DEG_test/DataExplore/” folder, including MDS and PCA plots. Axis1 vs axis2 differentiate our A and B samples.

MDS and PCA plots - axis 1 and 2MDS and PCA plots - axis 1 and 2

Figure 6: MDS and PCA plots - axis 1 and 2

8 DE analysis

The differential expression analysis can be started. We will play with the following parameters:

  • FDR threshold value
  • logFC threshold value
  • normalization method (TMM/RLE/upperquartile/none)
  • p-value adjust method (holm/hochberg/hommel/bonferroni/BH/BY/fdr/none)
  • GLM method (lrt/qlf)
# FDR threshold
parameters$threshold_FDR = 0.05
# logFC threshold
parameters$threshold_logFC = 0
# normalization method
parameters$normal_method = "TMM"
# p-value adjust method
parameters$p_adj_method = "BH"
# GLM method
parameters$glm = "lrt"

You can decide to get the Volcano or Mean-Difference Plots for each contrast:

# Mean-Difference Plot of Expression Data (aka MA plot)
parameters$plotMD = TRUE
# Volcano plot for a specified coefficient/contrast of a linear model
parameters$plotVO = TRUE

Once our parameters are defined, we can start the analysis.

# run differential expression analysis
resDEG<-DEanalysis(asko_norm, data, asko_data, parameters)
## 
## AC1<AC2 AC1=AC2 AC1>AC2 
##    1747    8358     886
## 
## AC1<AC3 AC1=AC3 AC1>AC3 
##    1323    9074     594
## 
## AC2=AC3 
##   10991
## 
## BC1<BC2 BC1=BC2 BC1>BC2 
##      45   10203     743
## 
## BC1=BC3 
##   10991
## 
## BC2<BC3 BC2=BC3 BC2>BC3 
##       1   10989       1
## 
## AC1<BC1 AC1=BC1 AC1>BC1 
##     635   10055     301
## 
## AC2<BC2 AC2=BC2 AC2>BC2 
##    2175    6599    2217
## 
## AC3<BC3 AC3=BC3 AC3>BC3 
##    1230    8126    1635
## 
## Create Summary file

For each contrast, you will find the number of over- or under-expressed genes. Genotype B does not show any major effects of the treatment, unlike genotype A. We also observe a certain number of differentially expressed genes between the genotypes.

This is summarized in a barplot :

DE numbers barplots

Figure 7: DE numbers barplots

A file named “Summary_DEresults.txt” is located in the DEG_test/DEanalysis/DEtables folder, which contains for each gene whether it is over-expressed (1) or under-expressed (-1) or neutral (0) for a given contrast. If you had provided an annotation file, these will be found in the last columns.

First lines of the :

AC1vsAC2 AC1vsAC3 AC2vsBC2 AC3vsBC3 Description
Gene_000002 0 0 0 0 hypothetical protei…
Gene_000003 -1 0 0 0 histone-lysine n-m…
Gene_000004 1 1 -1 0 hypothetical prote…

You’ll find in “DEG_test/DEanalysis/DEimages” directory, les Volcano, MD plots, Pvalue (raw and adjusted) graphs and heatmap for each contrast.

DEanalysis plotsDEanalysis plots

Figure 8: DEanalysis plots

DEanalysis plots

Figure 9: DEanalysis plots

Pvalue graphs

Figure 10: Pvalue graphs

9 Basic comparisons

You can compare your lists of differentially expressed genes using two methods: Venn diagrams or Upset graphs. Venn diagrams allow you to compare up to 4 lists while Upset allows you to make wider comparisons. However, if you have too many lists to display the graph may be unreadable.

9.1 Venn diagram

To display the Venn diagrams, you need to specify the type of comparison wanted parameters$VD:

  • “all” : Create VennDiagrams for all differentially expressed genes
  • “up” : Create VennDiagrams for gene expressed UP
  • “down” : Create VennDiagrams for gene expressed DOWN
  • “both” : Create VennDiagrams for gene expressed UP and DOWN (in the same graph)

Next, you must provide a list of the comparisons to display: parameters$compaVD. For exemple :

# this create 1 venn diagram
parameters$compaVD=c("Ctrast1-Ctrast2-Ctrast3") 
# this create 3 venn diagrams
parameters$compaVD=c("Ctrast1-Ctrast2-Ctrast3", 
                     "Ctrast4-Ctrast5-Ctrast6",
                     "Ctrast7-Ctrast8-Ctrast9")

Be careful, with the VD=“both” you will only have to provide two contrasts. Example:

# this create 1 venn diagram
parameters$compaVD=c("Ctrast1-Ctrast2") 
# this create 3 venn diagrams
parameters$compaVD=c("Ctrast1-Ctrast2", 
                     "Ctrast1-Ctrast3",
                     "Ctrast2-Ctrast3")

With our data, we will make 3 Venn diagrams for the different types (all, up and down).

parameters$compaVD = c("AC1vsAC2-AC1vsAC3-AC2vsAC3",
                       "BC1vsBC2-BC1vsBC3-BC2vsBC3",
                       "AC1vsBC1-AC2vsBC2-AC3vsBC3")


# graph type "all"
parameters$VD = "all"
VD(resDEG, parameters, asko_data)

# graph type "up"
parameters$VD = "up"
VD(resDEG, parameters, asko_data)

# graph type "down"
parameters$VD = "down"
VD(resDEG, parameters, asko_data)

To use the VD=“both” option, we can provide list of two contrasts.

# graph type "both"
parameters$compaVD = c("AC1vsBC1-AC2vsBC2",
                       "AC1vsBC1-AC3vsBC3",
                       "AC2vsBC2-AC3vsBC3")
parameters$VD = "both"
VD(resDEG, parameters, asko_data)

All graphs will appear in a folder named “DEG_test/VennDiagrams/”. Some example of venn diagrams :

Venn DiagramsVenn DiagramsVenn DiagramsVenn Diagrams

Figure 11: Venn Diagrams

9.2 Upset graphs

You can display all contrast, you just need to specify the type of comparison wanted parameters$upset_basic: - “all” : Create chart for all differentially expressed genes - “up” : Create chart for gene expressed UP - “down” : Create chart for gene expressed DOWN - “mixed” : Create chart for gene expressed UP and DOWN (in the same graph) - NULL : Don’t make graphs You can display multiples graphs based on list of contrast parameters$upset_list, you need to precise the type of comparison parameters$upset_type. Example:

# Precise type of comparison: all, down, up, mixed.
parameters$upset_type = "all"

# Give a list of contrast, for example:
# this create 1 graphs
parameters$upset_list = c("Ctrast1-Ctrast2-Ctrast3")   
# this create 3 graphs
parameters$upset_list = c("Ctrast1-Ctrast2-Ctrast3",  
                          "Ctrast4-Ctrast5-Ctrast6",
                          "Ctrast1-Ctrast2-Ctrast3-Ctrast4-Ctrast5")

With our data, we will make several upset charts for the different types (all, up, down and mixed), with all contrast and list of contrast.

parameters$upset_list = c("AC1vsAC2-AC1vsAC3-AC2vsAC3",
                          "BC1vsBC2-BC1vsBC3-BC2vsBC3",
                          "AC1vsBC1-AC2vsBC2-AC3vsBC3")

# graphs type "all"
parameters$upset_basic = "all" # all contrast
parameters$upset_type = "all"  # list of contrast
UpSetGraph(resDEG, data, parameters)

# graphs type "mixed"
parameters$upset_basic = "mixed" # all contrast
parameters$upset_type = "mixed"  # list of contrast
UpSetGraph(resDEG, data, parameters)

# graphs type "up"
parameters$upset_basic = "up" # all contrast
parameters$upset_type = "up"  # list of contrast
UpSetGraph(resDEG, data, parameters)

# graphs type "down"
parameters$upset_basic = "down" # all contrast
parameters$upset_type = "down"  # list of contrast
UpSetGraph(resDEG, data, parameters)

An “DEG_test/UpsetGraphs/” directory will be created with two subdirectories “DEG_test/ UpSetR_graphs/Global_upset/” and “DEG_test/UpsetGraphs/Subset_upset/”.
Some example of upset graphs (from subset “AC1vsBC1-AC2vsBC2-AC3vsBC3”):

UpsetR GraphsUpsetR GraphsUpsetR GraphsUpsetR Graphs

Figure 12: UpsetR Graphs

10 GO Enrichment Analysis

We uses the GOs annotations file to perform enrichment analysis on differentially expressed gene. For this, you define :

  • parameters$GO_threshold  the significant threshold used to filter p-values
  • parameters$GO_max_top_terms  the maximum number of GO terms plot
  • parameters$GO_min_num_genes  the minimum number of genes for each GO terms
  • parameters$GO  gene set chosen for analysis ‘up’, ‘down’, ‘both’ (up+down)
  • parameters$GO_algo  algorithms for runTest function (“classic”, “elim”, “weight”, “weight01”, “lea”, “parentchild”)
  • parameters$GO_stats  statistical tests for runTest function (“fisher”, “ks”, “t”, “globaltest”, “sum”, “ks.ties”)
  • parameters$Ratio_threshold  the min ratio for display GO in graph

After that, we can run Go enrichment analysis:

# Parameters
parameters$GO_threshold = 0.05
parameters$GO_max_top_terms = 10
parameters$GO_min_num_genes = 10
parameters$GO = "both"
parameters$GO_algo = "weight01"
parameters$GO_stats = "fisher"
parameters$Ratio_threshold = 1

# run analysis
GOenrichment(resDEG, data, parameters)

A “DEG_test/GOenrichment/” directory will be created with all GO images and tables of statistics. Example of graph:

By changing parameters$GO to “up” or “down”, you can execute GO-term enrichment on UP-regulated genes or DOWN-regulated genes separately.

GO Enrichment graphsGO Enrichment graphs

Figure 13: GO Enrichment graphs

Example of one statistical table:

GO.ID Term Annotated Significant Expected statisticTest Ratio GO_cat
GO:0003735 structural const… 135 36 10.81 4.6e-11 3.3302497687 MF
GO:0000155 phosphorelay sen… 22 7 1.76 0.0012 3.9772727272 MF
GO:0003729 mRNA binding 13 5 1.04 0.0024 4.8076923076 MF
GO:0036094 small molecule b… 1327 105 106.24 0.0038 0.9883283132 MF

Explications of some columns:

  • Annotated: number of genes in your genome annotated with GO-terms.
  • Significant: number of genes belonging to your input which are annotated with the GO-term.
  • Expected: show an estimate of the number of genes a node of size Annotated would have if the significant genes were to be randomly selected from the gene universe.
  • statisticTest: result of fisher test

Finally, for each analysis, a “NameOfTheContrast_SignificantGO” directory is created in which you can find, for each enriched GO-term, the genes that enabled enrichment.

11 Co-Expression Analysis

11.1 Gene clustering

ClustAndGO function is based on “coseq” package for gene clustering and enables to highlight DE genes (in at least 1 contrast) that have the same expression profile. Genes profiles are clustered using adapted transformations and mixture models or K-means algorithm. A model selection criteria is proposed to choose an appropriate number of clusters.

Through ClustAndGO, the list of the genes of each cluster is extracted as a table and graphical outputs are produced to visualize cluster profiles. The function provided in AskoR also allows automatically to describe each cluster (gene expression, gene lists shared with several contrasts, GO-enrichmment).

First, you have to define some parameters for the analysis. Here are the parameters to define for gene clustering analysis:

  • parameters$GO_threshold  the significant threshold used to filter p-values

  • parameters$coseq_data  the type of data you want to cluster

    • “LogScaledData”: log2(data+1); for this you have to put parameters$coseq_transformation=“none”
    • “ExpressionProfiles”: sample expression / sum of expression in all samples - choosen default approach by coseq creators
  • parameters$coseq_model  the algorythm for the clustering ‘kmeans’, ‘Normal’ (gaussian mixture model).

  • parameters$coseq_transformation  the transformation applied to the expression profiles for the clustering - sample expression / sum of expression in all samples, (“voom”, “logRPKM”, “arcsin”, “logit”, “logMedianRef”, “logclr”, “clr”, “alr”, “ilr”, “none”).

  • parameters$coseq_ClustersNb  the number of clusters to be build

    • a fixed number from 2 to 25, for example: parameters$coseq_ClustersNb=4
    • (by default) fixed range 2:25 (parameters$coseq_ClustersNb=2:25). Coseq choose a number chosen between 2 to 25 clusters based on the Integrated Completed Likelihood (ICL) criterion (in the case of the Gaussian mixture model) or on the slope heuristics (in the case of the K-means algorithm)
  • parameters$coseq_HeatmapOrderSample  choose “TRUE” if you don’t want the samples to be clustered and prefer to keep the initial sample order.

Note: If a GO annotation file has been provided by the user, don’t forget also to define parameters for GO enrichment analysis in each cluster (Cf. GO enrichment Section for parameters).

You can then run clustering analysis to highlight co-expression profiles:

# Parameters for gene clustering
parameters$coseq_data = "ExpressionProfiles"
parameters$coseq_model = "kmeans"     
parameters$coseq_transformation = "clr"
parameters$coseq_ClustersNb = 4   
parameters$coseq_HeatmapOrderSample = FALSE

# Parameters for GO enrichment 
parameters$GO_threshold = 0.05
parameters$GO_min_num_genes = 10
parameters$GO_algo = "weight01"
parameters$GO_stats = "fisher"

# Parameters for GO enrichment graphs
parameters$GO_max_top_terms = 10
parameters$GO_min_sig_genes = 2
parameters$Ratio_threshold = 2

# run analysis
clust<-ClustAndGO(asko_norm,resDEG,parameters, data)

A directory “DEG_test/Clustering/OnDEgenes/” is created, in which you will find one directory per analysis (the name will depend on the model, the transformation, and the number of clusters chosen).

Recommendations

  • Coseq is designed to clusterize expression profiles (sample expression / sum of expression in all samples) with one of the transformations available but some users may want to clusterize directly the scaled log+1 expression. In that case, set parameters$coseq_data="LogScaledData" &nbsp and parameters$coseq_transformation="none"

  • “K-means” model is widely used and quite fast but it is a particular case of a gaussian mixture model where samples are suposed to be independant (correlation between samples = 0 and same variance). If you want to use a more flexible model, you set the parameters$coseq_model=“Normal” to apply a global gaussian mixture model.

  • When using “K-means” model, the “clr” transformation is recommended by coseq in many cases but, if you are trying to identify very specific clusters (for example tissue specific clusters) you can test the “logclr” transformation.

  • When using “Normal” model, “arcsin” or “logit” transformations should be tested at the beginning.

  • Analyze your results of GO-enrichment very carefully if you have less than 100 genes in the cluster.

11.2 Global outputs

Several files (tables and plots) are saved in the directory of the analysis and one directory per cluster is created (which contains detailed files for each cluster).

11.2.1 Tables

clusters.coexpr. AC1 AC2 AC3 BC1 BC2 BC3 AC1vsAC2 AC1vsAC3 AC2vsAC3 BC1vsBC2 BC1vsBC3 BC2vsBC3 AC1vsBC1 AC2vsBC2 AC3vsBC3 Description
Gene_000003 2 0.4419681 1.765120 1.041042 1.698956 1.0034991 0.8513948 -1 0 0 0 0 0 -1 0 0 histone-lysine n-methyltransferase nsd2
Gene_000004 1 47.5222800 33.976130 33.823440 42.246010 51.9250600 46.6207000 1 1 0 0 0 0 0 -1 0 hypothetical protein pbra 009496
Gene_000006 2 0.4632453 2.305171 1.047562 1.402427 0.6600896 0.6161691 -1 0 0 0 0 0 0 1 0 cilia- and flagella-associated 57-like


11.2.2 Plots

A boxplot is produced to visualize the log scaled expression of all the genes in each cluster and each experimental condition:

Boxplot

Figure 14: Boxplot

This is also represented as a heatmap, which can more easily enable to see gene proportion in each cluster and each experimental condition:

Heatmap

Figure 15: Heatmap

Coseq package produces probability graphs that are concatenated in one graph in ClustAndGO function. On the left you can see if clusters are robusts (if the gene is affiliated to the same cluster for more than 80% of the iterations it appears in black and its affiliation is supposed to be robust - the affiliation of red genes is less robust). On the right, you can see the number (and so evaluate the proportion) of robust and non-robust genes in each cluster:

Probabilities

Figure 16: Probabilities

11.2.3 Per-cluster outputs

In the directory of the clustering analysis, a supplementary directory is created for each cluster in which you can find tables, plots, and a sub-directory which contains the lists of the genes for each GO-term enriched in the cluster.

11.2.3.1 Scaled-expression boxplot

First, for each cluster, log scaled expression is represented for the genes of the cluster.
Probabilities

Figure 17: Probabilities

11.2.3.2 UpSet plot of DEGs in contrasts

To identify in each cluster the DE genes specific to each contrast or common to several, the intersections between DE gene lists are represented with the UpSetR package.

Probabilities

Figure 18: Probabilities

11.2.3.3 GO-enrichment

If a GO-annotation file is provided by the user, ClustAndGO performs automatically GO-term enrichment analysis for each cluster.

Bargraphs
for each GO category are saved with the information of the enrichment ratio, the number of genes behind the enrichment and the p-value :

GO-enrichmentGO-enrichmentGO-enrichment

Figure 19: GO-enrichment

Enrichment of contrast-specific genes
Since not all genes are DE in all contrasts (clustering performed on the list of DE genes in at least 1 contrast of the experiment), a representation highlighting the number of DE genes in each cluster is created.

The plot shows, for each contrast of interest, the percentage of DE genes in the contrast in the cluster compared to the total number of DE genes in the contrast : “” indicates whether the contrast is enriched in genes of the cluster. To define this significance, a Chi2 test is performed between 1) the (observed) proportion of genes in each contrast belonging to the cluster relative to the total number of genes in the contrast, and 2) the (expected) proportion of total genes in the cluster relative to the total number of DE genes in at least 1 of the contrasts (total number of genes used for clustering). If the Chi2 p-value is less than 0.05, and the observed proportion is greater than the expected proportion in a particular contrast, the contrast is statistically significantly enriched in genes of the cluster (data embedded in the graph: ”” 0.001; ”” 0.01; “*” 0.05).

For example in cluster 3, the following plot shows that more than 750 genes are DE in contrast AC1vsAC3 and account for 40.5% of the total DE genes in contrast AC1vsAC3 (whereas the cluster 3 represents only 16.8% of the total number of total DE genes); this proportion shows that contrast AC1vsAC3 is significantly (“***“) enriched in genes of the cluster 3 (the contrast would not have been significantly enriched with genes of the cluster 3 if 16.8% of its genes had belonged to cluster 3, according to a random distribution of genes in the contrasts).

Be careful: the lower the number of genes (in the contrast and/or in the cluster), the less reliable the Chi2 test is. The interpretation of the significance is then difficult.

Gene enrichment in contrasts

Figure 20: Gene enrichment in contrasts

11.2.4 To go further …

If you want, then you can include genes that are not DE in an additional (artificial) cluster and visualize it with the clusters identified on the DE genes with ClustAndGO function. This will also realize a GO-enrichment on these genes in a “NOT_DE” folder.

NOTE: you have to run this function after creating “clust” object with your ClustAndGO analysis (you can create multiple “clust” objects with different names and run the function on it).

IncludeNonDEgenes_InClustering(data, asko_norm, resDEG, parameters, clust)

It can allow you to visualize the profile of the genes that are not DE and their proportion in the whole experiment (heatmap is also available):

Clustering with non DE genes

Figure 21: Clustering with non DE genes

12 Tips and tricks

12.1 How to perform a GO-enrichment analysis on a specific gene list ?

If you want, you can also perform GO-enrichment on a specific list of genes (list of interest from another study, list of genes hightly expressed, …). In this case, you can use the GOenrichment function with two supplementary parameters: the list of the names of the genes you want to analyze and the name you want to set for this list. This function produces the same outputs than the GOenrichment function used on contrasts in ‘GO Enrichment Analysis’ part.The directory used for the outputs is “DEG_test/GOenrichment/NameOfTheList/”. Here is an example on the 1000 first genes of the list:

list=rownames(resDEG[1:1000,])
GOenrichment(resDEG, data, parameters, list, "First1000genes")

12.2 How to perform a clustering analysis on a specific gene list ?

On the same way, you can perform clustering on a specific list of genes (list of interest from another study, list of genes hightly expressed, …). In this case, you can use the ClustAndGO function with two supplementary parameters: the list of the names of the genes you want to analyze and the name you want to set for this list. This function produces the same outputs than the ClustAndGO function used on all the DE genes in ‘Gene clustering for Coexpression analysis’ part. The directory used for the outputs is “DEG_test/Clustering/NameOfTheList/”.Here is an example on the 1000 first genes of the list:

list=rownames(resDEG[1:1000,])
clust <- ClustAndGO(asko_norm,resDEG,parameters, data, list, "First1000genes")

12.3 How to extract gene informations from a specific gene list ?

Another function have been developed to enable the user to obtain heatmap expression and summary table (with all informations generated during the analysis : name of the genes, normalized expression in CPM in each experimental condition, DE status in contrasts, gene description, and cluster membership).

In addition to the objects “resDEG” and “data”, the function needs a list containing the names of the genes you want to analyze and a name for the list. A “DEG_test/GeneListExplore/NameOfTheList/” directory is created.

First, you can draw a heatmap with scaled expression (CPM) of a specific gene list of interest together with the status (OVER/UNDER-differentially expressed or not). Here is an example with the first 25 genes of the dataset:

list=rownames(resDEG[1:25,])
GeneInfo_OnList(list, resDEG, data, parameters,"First25genes")
Gene information

Figure 22: Gene information

In this heatmap, a hierarchical clustering in performed on the genes (default clustering of the Heatmap() function of the ComplexHeatmap package). But if you had run ClustAndGO analysis and so created “clust” object, you can also sort the genes by clusters by adding “clust” as a supplementary parameter:

list=rownames(resDEG[1:25,])
GeneInfo_OnList(list, resDEG, data, parameters, "First25genes_WithClust", clust)
Gene information with clustering

Figure 23: Gene information with clustering

In this heatmap, a hierarchical clustering in performed on the genes (default clustering of the Heatmap() function of the ComplexHeatmap package). But if you had run ClustAndGO analysis and so created “clust” object, you can also sort the genes by clusters by adding “clust” as a supplementary parameter:

conditionsToDraw = c("AC1", "AC2", "AC3")
contrastToDraw = c("AC1vsAC2","AC1vsAC3","AC2vsAC3")
GeneInfo_OnList(list, resDEG, data, parameters, "First25genes_WithClust", clustering=clust, conditions=conditionsToDraw, contrasts=contrastToDraw)
Subset gene information with clustering

Figure 24: Subset gene information with clustering